Group Members: Christian Marcelo Chafla Bastidas r0874332, Anja Derić r0873512, Theodore Gautier r0814273, Peter Day r0866276 Submission Date: 20/6/2022
Research Question
Research Design
Analysis & Results
Conclusions
Pitch
References
This project focuses on the issue of food deserts and lack of access to grocery stores in the United States of America. To that end, the research questions this project aims to answer are as follows:
Is there a relationship between demographic variables, availability of grocery stores, and prevalence of health-related issues in the US?
Is number of grocery stores per square mile related to obesity when controlling for a number of other variables?
Do grocery store and fast food restaurant reviews and ratings vary between areas in the US which are considered food deserts and those which are not?
These research questions are relevant because they can show how heterogeneous distribution of grocery stores can lead to a number of health problems and disproportionately effects poor, less-educated, and black people.
As indicated by the research questions outlined above, our design consisted of two different approaches/segments of data. The first part of our research focuses on using demographic, economic, and similar administrative data to explore what factors impact the development and existence of food deserts, as well as how living in food desert areas can impact individual health.
The second segment of our research focuses more on attitudes of individuals living within food desert and non-desert areas, specifically looking at Yelp review data and comparing reviews between the two areas for grocery stores as well as fast food restaurants.
This design setup allows us to get a broad and diverse overview of the impact of food deserts in the United States. The next section of this report will cover our data collection process, descriptive statistics, unsupervised and supervised learning, as well as API data anlysis and sentiment analysis.
Note: Some segments of code are suppressed from the output so as to make the report more clear and concise. All code segments are included in the final .Rmd file, but not necessarily this rendered report.
The data used for descriptive analytics, unsupervised, and supervised learning comes from the following sources:
Grocery Store Data National Neighborhood Data Archive (NaNDA): Grocery Stores by ZIP Code Tabulation Area, United States, 2003-2017
Demographic Data US Census: American Community Survey (ACS)
Health Data US CDC: PLACES: Local Data for Better Health, ZCTA Data 2021 release
All data are estimates for United States ZIP Code Tabulation Areas, which we used to join the various datasets. In the case of missing data we mostly removed the observations. With this much data we didn’t want to accidentally impute a value, say zip code, which makes no sense to impute. In addition, we ended up with data for 28,000 ZCTAs, and thus were not lacking for data.
The health data had estimates for 32 health conditions, which we narrowed down to five. The grocery store data had 32 variables which we narrowed down considerably, focusing on supermarkets/grocery stores and not on specialty food or warehouse club stores. Some renaming of variables was done to clarify variable meanings.
The rest of the data come from the American Community Survey from the US Census. Some of these files had up to 700 variables, including estimates as well as margin of error fields, which we did not use. These files came with metadata files which aided greatly in narrowing down the number of variables. As we had race data for each ZCTA we decided not to include educational data broken down by race. The education data cleaning included many renames to make the variable names as concise and meaningful as possible. We used mean household income for each ZCTA. Race estimates were in seven categories, including Alaskan Native, Asian, Black and White.
The data used for the analysis of reviews and ratings of grocery stores and fast food restaurants was gathered using the Yelp API, which allows for 5,000 free daily requests. The specific endpoints and requests used to gather the data are specified in Section 3.5 API Data + NLP.
All data used in Sections 3.2 through 3.4 of the report was downloaded directly from publicly accessible websites (in most cases, from the United States government). No web scraping was performed, all data collected was in the form of Excel of CSV files. The data itself was reported on a zip code level, so consideration regarding individuals’ informed consent and privacy were not necessary. Furthermore, there are many academic studies that have previously explored food deserts and factors that impact or are related to their locations. The results and analysis from these studies is generally available to the public.
For section 3.5, we used Yelp API data. Reviews and rating were collected on a business level. Businesses were tracked using a business_id. While the business ID can be used to track down specific locations of stores, this information is publicly accessible on Yelp, both through the developer portal (which is free to the public), as well as the actual Yelp website (also free to the public). We additionally collected reviews from individual users for the businesses of interest. No user information was collected in this process, only the text of the review and the corresponding rating. Just as with businesses, individual review information is publicly available and can be searched up on Yelp for free. As a result, there were no special considerations with respect to ethics in our project.
Data cleaning steps for the non-API data are briefly discussed in 3.1.1 Demographic and General Data. Due to the sheer size of the original data files and the lengthy process cleaning these files, we have excluded the portion of the code used for cleaning from this final report. However, all the data cleaning R files can be found in their respective folders on our publicly accessible GitHub page linked HERE. The remaining pre-processing steps are done at the beginning of each section as different steps needed to be taken for different types of analyses. API data pre-processing is shown in Section 3.5 API Data & Sentiment Analysis.
grocery <- read.csv('grocery.csv')
health <- read.csv('health.csv')
race <- read.csv('race.csv')
income <- read.csv('income_cleaned.csv')
edu <- read.csv('education.csv')
From the 2017 Department of Agriculture report, thousands of Americans live in food deserts in the United States, with low access to healthy and affordable food and low income being one of the most important variables in determining where these areas are in the US (USDA, 2017). Having this in mind, we will try to understand the relationship between different demographic and economic variables that affect food deserts in the US.
Before we dive into more detailed and complex machine learning algorithms, we will first run some basic descriptive statistics to understand how these variables are related to one another. We will also make dome plots to have a better idea of some variables’ distributions.
First we start by plotting the total population distribution of all zip codes available in our data set, followed by the population distribution for white people, black people, mixed race, and all other racial minorities will be classified as other for the histogram.
From what we can see from the plots above it is clear that for most of our data set, the majority of zip codes are zones populated by under 5000 people, meaning that having some extreme observations to the right side of the distribution will drag the mean to the right (mean > median). This will translate in a distribution that is positively skewed. This right-skewed distribution will be transposed to all sub-populations.
Next, we will plot the distribution of our health variables being of great importance to understand food deserts in the US. The health variables we are most interested are Obesity, Depression, Cancer, and Cholesterol.
From the plots we can see that Obesity and Depression follow what seems to be almost a normal distribution, while Cancer and Cholesterol seem to have some extreme values that cause the distribution to be a little skewed to the right for Cancer and to the left for Cholesterol.
To complement the graphs we did above, we will also run some descriptive statistics that will be shown next.
| mean | sd | median | min | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|
| OBESITY | 34.348921 | 5.411040 | 35.0 | 12.9 | 56.3 | -0.3445694 | 0.5392805 |
| DEPRESSION | 21.612375 | 3.484605 | 21.7 | 9.9 | 35.9 | 0.0832339 | -0.0433909 |
| CANCER | 7.461743 | 1.493440 | 7.6 | 0.7 | 20.6 | -0.1776041 | 2.2571617 |
| CHOLSCREEN | 89.339782 | 2.810300 | 89.6 | 64.4 | 98.5 | -1.3043510 | 5.2035176 |
Based on the results from the USDA we know that income is an important variable to understand food deserts in the USA. For this reason we will repeat the same process but now we will graph the distributions based on income (Low, High) where our barrier to classify between low income and high income for this graphs will be the median of the mean income in each zip code.
ggplot(data, aes(x=obesity, fill = as.factor(income)))+
geom_histogram( color='#e9ecef', alpha=0.6, position='identity', stat = "count")+
geom_vline(aes(xintercept=mean(OBESITY)), color="blue", linetype="dashed", size=1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(data, aes(x=depression, fill = as.factor(income)))+
geom_histogram( color='#e9ecef', alpha=0.6, position='identity', stat = "count", binwidth = 1)+
geom_vline(aes(xintercept=mean(depression)), color="blue", linetype="dashed", size=1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning in mean.default(depression): argument is not numeric or logical:
## returning NA
## Warning: Removed 28405 rows containing missing values (geom_vline).
As we can see from the output above, there seems to be a difference in the distributions of obesity by zip code based on income (HIGH, LOW). The same pattern seems to affect depression. We ran also the same code for cancer and cholesterol, but there seems not to be a difference for this two variables but more testing for this seems to be adequate.
Now again having in mind that income is an important variable for food deserts and health problems like obesity, we will plot the number of zip codes that have a specific number of grocery stores.
par(mfrow = c(1,1))
ggplot(data, aes(x=grocery_count, fill = as.factor(income)))+
geom_histogram( color='#e9ecef', alpha=0.6, position='identity', stat = "count", binwidth = 0.5)+
xlim(0, 35) + ylim(0, 3500)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 244 rows containing non-finite values (stat_count).
## Warning: Removed 4 rows containing missing values (geom_bar).
From what we can see from the plot above, there seems to be a relationship between income and number of grocery stores.
To test this we will create a table between income and grocery count. We will also include a table for the relationship between income and grocery per square mile.
table_one <- tableby(income ~ grocery_count, data = data)
kable_styling(kable(summary(table_one, title = "Grocery Count by Income")), position = "center")
| high (N=14202) | low (N=14203) | Total (N=28405) | p value | |
|---|---|---|---|---|
| grocery_count | < 0.001 | |||
| Mean (SD) | 5.211 (8.236) | 3.555 (7.378) | 4.383 (7.862) | |
| Range | 0.000 - 146.000 | 0.000 - 123.000 | 0.000 - 146.000 |
table_two <- tableby(income ~ grocery_persqmile, data = data)
kable_styling(kable(summary(table_two, title = "Number of Groceries by Income")), position = "center")
| high (N=14202) | low (N=14203) | Total (N=28405) | p value | |
|---|---|---|---|---|
| grocery_persqmile | < 0.001 | |||
| Mean (SD) | 1.086 (5.992) | 0.541 (3.719) | 0.814 (4.994) | |
| Range | 0.000 - 154.736 | 0.000 - 117.796 | 0.000 - 154.736 |
We will also replicate the same methodology for grocery per square mile and obesity, taking into account if the zip code is populated with over 50% of black population.
ggplot(data, aes(x=obesity, fill = as.factor(majority)))+
geom_histogram( color='#e9ecef', alpha=0.6, position='identity', stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(data, aes(x=depression, fill = as.factor(majority)))+
geom_histogram( color='#e9ecef', alpha=0.6, position='identity', stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
table_obe <- tableby(majority ~ OBESITY, data = data)
kable_styling(kable(summary(table_obe, title = "Obesity by Mostly Black Populated Zip Codes")), position = "center")
| black_neigh (N=1136) | non_black_neigh (N=27269) | Total (N=28405) | p value | |
|---|---|---|---|---|
| OBESITY | < 0.001 | |||
| Mean (SD) | 42.217 (5.077) | 34.021 (5.171) | 34.349 (5.411) | |
| Range | 27.600 - 56.300 | 12.900 - 54.200 | 12.900 - 56.300 |
table_dep <- tableby(majority ~ DEPRESSION, data = data)
kable_styling(kable(summary(table_dep, title = "Depression by Mostly Black Populated Zip Codes")), position = "center")
| black_neigh (N=1136) | non_black_neigh (N=27269) | Total (N=28405) | p value | |
|---|---|---|---|---|
| DEPRESSION | < 0.001 | |||
| Mean (SD) | 20.104 (2.648) | 21.675 (3.501) | 21.612 (3.485) | |
| Range | 10.800 - 31.200 | 9.900 - 35.900 | 9.900 - 35.900 |
As we can see from the tables presented before, there seems to be a difference in the number of grocery stores per sqr mile and number of stores per zip code based on income. To complement this we will show some descriptive statistics for all grocery store variables: Grocery_count, Grocry_persqmile, and Grocery_per1k.
| mean | sd | median | min | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|
| grocery_count | 4.38 | 7.86 | 1.00 | 0 | 146.00 | 4.93 | 43.54 |
| grocery_persqmile | 0.81 | 4.99 | 0.02 | 0 | 154.74 | 14.67 | 264.30 |
| grocery_per1k | 0.43 | 0.78 | 0.28 | 0 | 27.52 | 10.28 | 212.44 |
Now, we will try to check the distribution of all our educational variables. To get a better grasp on how they are distributed we will again do this taking into account income.
Next, we will create a correlogram between our health variables, our grocery variables, porcentage of black population for each zip code, and the mean income of the zip code.
integrals <- data %>% dplyr::select( OBESITY, DEPRESSION, CANCER, CHOLSCREEN, CHD, mean_income, grocery_count, grocery_per1k, grocery_persqmile, black_prop)
# we make a correlation matrix
par(mfrow = c(1,1))
covariation <- cov(integrals)
correlation <- cov2cor(covariation)
corrplot::corrplot(correlation,
is.corr = FALSE,
method = "color",
type = "upper",
addCoef.col = "black")
From the output we can see there seems to be a relationship between CHD and the other health variables. Obesity seems to be positively correlated with Depression. There is some negative correlation between income and some of the health diseases variables, like obesity and depression, while some positive correlation with high cholesterol. There seems to be correlation between store count and store per square mile with income.
Interestingly there are positive correlations between number of grocery stores per square miles and grocery count with zip codes with higher percentages of black population. There seems to also be a positive correlation with obesity. This could give a clue that there might be a relationship between the quality of the food available in zip codes with a higher black population. Clearly further tests have to be made for this.
Creation of additional variables for subsequent analysis:
#creation of a new percentage of white variable and data frame
#containing percent and pop
race["per_white"] <- race[3]/race[2]
racew <- race[c(1,2,10)]
#creation of new variable looking at the percentage of people over the age of 25 who did not obtain a High School diploma
education <- edu[c(1,6,7)]
education['gt25_lt_hs'] <- education[2]+education[3]
educa <- education[c(1,4)]
Putting all the data together in one data frame:
#putting all tables together
df_list <- list(educa,groc,health,inc,racew,zip_county)
test <- df_list %>% reduce(inner_join , by="zcta")
head(test)
## zcta gt25_lt_hs year population aland10 grocery_count grocery_per1k
## 1 1001 7.7 2017 17537 11.5046400 7 0.3991561
## 2 1002 3.3 2017 30280 55.0646500 4 0.1321004
## 3 1003 0.0 2017 11131 0.7113496 0 0.0000000
## 4 1005 4.2 2017 5014 44.2621300 3 0.5983247
## 5 1007 3.4 2017 14906 52.6010300 3 0.2012612
## 6 1008 11.6 2017 1272 53.8019500 0 0.0000000
## grocery_persqmile Year Geolocation TotalPopulation
## 1 0.6084500 2019 POINT (-72.62581515 42.06255509) 16769
## 2 0.0726419 2019 POINT (-72.4509085 42.38758794) 29049
## 3 0.0000000 2019 POINT (-72.52475918 42.39190728) 10372
## 4 0.0677780 2019 POINT (-72.10611705 42.42009493) 5079
## 5 0.0570331 2019 POINT (-72.40031647 42.27869142) 14649
## 6 0.0000000 2019 POINT (-72.95202996 42.18429555) 1263
## OBESITY DEPRESSION CANCER CHOLSCREEN CHD mean_income tot_pop per_white state
## 1 25.8 22.1 9.1 91.8 6.9 84237.00 16064 0.9121016 MA
## 2 22.1 23.1 5.0 83.1 4.5 102003.00 30099 0.7423170 MA
## 3 20.6 29.6 0.7 65.2 1.2 91592.93 11588 0.6885571 MA
## 4 31.1 21.6 7.7 91.2 5.7 101548.00 5166 0.9440573 MA
## 5 24.4 21.9 7.1 92.0 4.8 110535.00 15080 0.9187666 MA
## 6 27.0 23.1 7.9 91.9 6.1 96822.00 1116 0.9802867 MA
## county
## 1 Hampden County
## 2 Hampshire County
## 3 Hampshire County
## 4 Worcester County
## 5 Hampshire County
## 6 Hampden County
Making a new data frame with only the relevant information and removing duplicate variable rows:
#retaining only important info
dat <- test[-c(1,3,4,6,7,9:11)]
#creationg of a new variable for pop density
dat["density pop/sqMile"] <- dat[10]/dat[2]
#remove area
dat <- dat[-2]
The first step in our analysis will be to to aggregate the mean variables for each state. We lose a lot of information from the individual observations but this helps us have a general look at the states and see whether we can find interesting clusters or patterns.
# Removing county name and summing the mean for each state with the weight being the total population of each data point
dat_state <- dat[-12]
#aggregating the values with population as weights
dat_table <- data.table(dat_state)
mean_state <- data.frame(dat_table[, lapply(.SD, weighted.mean, w=.SD$tot_pop), by=state])
#removing the population variable
mean_state <- mean_state[,-10]
#Changing the row name to the state
rownames(mean_state) <- mean_state[,1]
mean_state <- mean_state[,-1]
head(mean_state)
## gt25_lt_hs grocery_persqmile OBESITY DEPRESSION CANCER CHOLSCREEN
## MA 9.075590 2.8535828 25.88335 20.25257 7.106639 90.19558
## RI 11.092767 2.4338993 30.03943 22.31468 7.261369 90.63312
## NH 6.715812 0.4228367 30.60422 21.15415 7.302346 90.44387
## ME 6.801984 0.2720866 31.49979 23.91703 8.100160 89.20433
## VT 6.519581 0.3862522 27.35242 22.36489 7.312865 88.73219
## CT 9.269677 2.0243417 28.72440 16.82257 7.176492 91.56941
## CHD mean_income per_white density.pop.sqMile
## MA 5.446455 116805.92 0.7654700 5324.7683
## RI 5.945513 92013.30 0.7898193 4319.3442
## NH 5.363952 102495.68 0.9196405 902.2109
## ME 6.914559 78680.32 0.9366666 502.8992
## VT 5.813694 84188.46 0.9361448 505.4003
## CT 5.077108 117166.49 0.7417718 2579.3221
#centering and scaling
z_mean_state<-scale(mean_state,scale=TRUE,center = T)
Once our data is scaled and centered we can start to look for patterns in our data. We will look at attempting to find clusters between the states using a k-means and hierarchical clustering algorithms.
The first step in this process is to find the optimal number of clusters to use. This hyperparameter can be found by iterating over the data for a particular number of clusters, in this case we look at clusters between 1-20 with each iteration being done 25 times per cluster number. After this we will look at how the within sums of squares (wss) will decrease for each additional cluster.
#storing the wss in a list for each cluster
wss_list = list()
#looping over the data frame 20 times for each cluster number
for (i in 1:20) {
k_means = kmeans(z_mean_state, centers=i, nstart=25)
wss_list[[i]]=tibble(k=i, ss=k_means$tot.withinss)
}
#store the obtained wss in a list
wss_list = bind_rows(wss_list)
We then make a plot comparing the number of clusters with the wss to choose the optimal amount of clusters. We try to find an elbow pattern where the wss does not decrease enough to warant another cluster.
#plotting wss v number of clusters
ggplot(wss_list, aes(x = k, y = ss))+ geom_line()+ geom_point()+ xlab("Number of Clusters")+ ylab("Within groups sum of squares (WSS)")
The plot shown does not have a nice elbow shape for us to be able to choose a specific cluster number but we decide to choose 5 clusters as this reduces the wss enough while keeping the number of clusters low.
Now that we have our hyperparameter we can rerun the algorithm on only clusters with 50 different k means starting points
#rerunning the kmeans algo with 50 different starts
k_means_final = kmeans(z_mean_state, 4, nstart = 50)
print(k_means_final)
## K-means clustering with 4 clusters of sizes 19, 14, 2, 15
##
## Cluster means:
## gt25_lt_hs grocery_persqmile OBESITY DEPRESSION CANCER CHOLSCREEN
## 1 -0.8064823 -0.20787759 -0.1519727 0.06723619 0.4666288 -0.3880485
## 2 0.6203544 -0.31249630 1.0854079 0.92214826 0.4142257 0.1224373
## 3 0.2093941 4.24701204 -1.1772018 -0.62991390 -1.2971288 1.2366967
## 4 0.4146276 -0.01129344 -0.6635885 -0.86184903 -0.8047233 0.2123603
## CHD mean_income per_white density.pop.sqMile
## 1 -0.1949918 -0.06000974 0.73199784 -0.2813811
## 2 1.1417737 -0.90566241 0.01319079 -0.4101791
## 3 -1.0608537 2.02970335 -1.65268371 4.4147893
## 4 -0.6772188 0.65067014 -0.71915084 0.1506114
##
## Clustering vector:
## MA RI NH ME VT CT NY PA DE DC VA MD WV NC SC GA FL AL TN MS KY OH IN MI IA WI
## 4 1 1 1 1 4 3 1 1 3 4 4 2 2 2 4 4 2 2 2 2 2 2 2 1 1
## MN SD ND MT IL MO KS NE LA AR OK TX CO WY ID UT AZ NM NV CA HI OR WA AK
## 1 1 1 1 4 2 1 1 2 2 2 4 4 1 1 1 4 4 4 4 4 1 1 4
##
## Within cluster sum of squares by cluster:
## [1] 80.57517 38.68255 17.60945 96.96733
## (between_SS / total_SS = 52.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Here we can see the means for each clusters on the variables in our data frame and which state belongs to each cluster. To have a more visual understanding we plot this solution on a 2 dimensional plane.
Plotting the clusters on a 2-dimensional plane:
#Plotting the solution using fviz
fviz_cluster(k_means_final, z_mean_state, ellipse.type = "norm")
## Too few points to calculate an ellipse
Our first immediate observation is the small first cluster composed only of the states of DC and NY which stand out far out of the rest. This cluster has very high mean of grocery per square mile ( mean of 4.25 standard deviations) while also having the lowest means of obesity and cancer. This cluster also has the highest mean income. Not surprisingly it also has the highest mean population density. The other 3 clusters do not separate the states nicely. The 3rd cluster does have the highest obesity means while also having the lowest amount of grocery per sq mile. Additionally this cluster has the highest rates of depression and coronary heart disease (CHD) while also having the lowest population density and mean income of all the clusters.
Our second method is to look at hierarchical clustering to see if we can come up with more distinct clusters. Once again we look at the wss to check the optimal amount of clusters:
#plotting the hclust optimal clusters
fviz_nbclust(z_mean_state, FUN = hcut, method = "wss")
We will use the same amount of clusters as found in our previous clustering algorithm,k=4, which will allow us to compare the two. With hierarchical clustering we can easily plot the dendrogram to have a nice visualization of the clusters.
#Hierchical clustering using Hcut algo
hierch_clust = hcut(z_mean_state, k=4, hc_method = "complete")
#plotting the dendogram
plot(hierch_clust, cex = 0.55,hang = -1)
rect.hclust(hierch_clust, k = 4,border = 3:6)
We see that with the 4 cluster solutions we get quite different ones than with the kmeans algorithm. We get one giant cluster and then three smaller ones with NY being its own cluster. Visualizing this solution on a 2d plane we get the following results:
#visualisation of the solution
fviz_cluster(hierch_clust)
Using Hierchical classification we get much more defined clusters than using kmeans. We once again observe that NY is very far away from the rest of the data.
We also try to use principal component analysis (PCA) to reduce the dimensionality of our data.
Looking at the percentage of variation explained by each principal component:
# creating the pca object
pca = prcomp(z_mean_state)
# plotting variance explained by each principal component
fviz_eig(pca)
Plotting the solution with the first two principal components on a biplot :
#Biplot of first two dimensions
fviz_pca_biplot(pca)
We see that density and grocery per sq mile are very close together and go in the same direction. We also observe that most of the health related variables go opposite of the mean income variable which would imply that this is a more important factor in understanding health problems in certain states. Grocery store access also seems to play a role in health measures as they each load in different directions on the first principal component As this analysis was done at a state level, we lose a lot of information on individual food deserts in states. Especially when food deserts often have lower populations and thus are drowned out by the large cities which much better access to grocery stores.
This is why we now focus on the counties in a single state.
To get a better understanding of how food deserts work in specific states, we first sorted the data by grocery store per square mile and by descending population.
#Sorting the date by grocery and population
ordered <- arrange(dat,grocery_persqmile,desc(dat$tot_pop))
#looking at the number of times each state appears in the data set
nState <- ordered %>% dplyr::count(ordered$state,sort = TRUE)
names(nState)[1] <- "state"
Next we took the first ordered 10,000 data points to look where most of the food deserts were located.
ordered_low <- ordered[1:10000,]
#counting the number of appearances of each state
nLow <- ordered_low %>% dplyr::count(ordered_low$state,sort = TRUE)
names(nLow)[1] <- "state"
#making a new variable which is the amount of appearances of a zip code in a certain state in the lowest 10,000 over the total amount of zipcodes in that certain state
nTot <- merge(nLow,nState, by="state")
nTot["%lowFoodAccess"] <- nTot$n.x/nTot$n.y
nTot_ordered <- arrange(nTot,desc(nTot$'%lowFoodAccess'))
nTot_ordered
## state n.x n.y %lowFoodAccess
## 1 WY 89 107 0.83177570
## 2 AK 163 201 0.81094527
## 3 MT 221 279 0.79211470
## 4 ND 226 321 0.70404984
## 5 NM 147 227 0.64757709
## 6 SD 201 311 0.64630225
## 7 WV 279 498 0.56024096
## 8 NE 287 518 0.55405405
## 9 IA 481 885 0.54350282
## 10 KS 333 625 0.53280000
## 11 UT 123 247 0.49797571
## 12 OK 295 608 0.48519737
## 13 VT 110 230 0.47826087
## 14 ID 107 224 0.47767857
## 15 CO 197 434 0.45391705
## 16 MO 410 904 0.45353982
## 17 IL 593 1308 0.45336391
## 18 MN 374 851 0.43948296
## 19 AR 214 495 0.43232323
## 20 ME 159 377 0.42175066
## 21 PA 629 1555 0.40450161
## 22 WI 285 748 0.38101604
## 23 AZ 137 364 0.37637363
## 24 IN 264 711 0.37130802
## 25 NV 49 141 0.34751773
## 26 KY 210 618 0.33980583
## 27 NH 77 229 0.33624454
## 28 OH 342 1107 0.30894309
## 29 NY 459 1555 0.29517685
## 30 HI 24 83 0.28915663
## 31 OR 99 357 0.27731092
## 32 TX 465 1704 0.27288732
## 33 MD 110 405 0.27160494
## 34 WA 135 512 0.26367188
## 35 VA 196 749 0.26168224
## 36 MS 96 380 0.25263158
## 37 SC 93 382 0.24345550
## 38 MI 221 924 0.23917749
## 39 LA 110 460 0.23913043
## 40 AL 131 571 0.22942207
## 41 TN 125 580 0.21551724
## 42 NC 158 744 0.21236559
## 43 DE 12 57 0.21052632
## 44 GA 136 677 0.20088626
## 45 MA 99 501 0.19760479
## 46 CT 40 261 0.15325670
## 47 RI 10 68 0.14705882
## 48 CA 223 1544 0.14443005
## 49 DC 2 24 0.08333333
## 50 FL 54 745 0.07248322
We can observe that Wyoming, Alaska and Montana have a very high percentage of entries in the lowest 10,000. Thus we decide to focus our research on WY to see if we can differentiate between the counties in Wyoming.
#Selecting the datapoints belonging in Wyoming
desert_WY <- subset(ordered, ordered$state=="WY")
desert_WY <- desert_WY[-11]
#Aggregating the mean variables for each county with the weight being the population
my_dt <- data.table(desert_WY)
df <- data.frame(my_dt[, lapply(.SD, weighted.mean, w=.SD$tot_pop), by=county])
df <- df[,-10]
#changing the row names to the county names and removing the total pop variable
rownames(df) <- df[,1]
df <- df[,-1]
head(df)
## gt25_lt_hs grocery_persqmile OBESITY DEPRESSION CANCER
## Lincoln County 7.708254 0.001010544 27.66163 18.72127 7.284406
## Natrona County 6.198128 0.037049590 30.88646 20.82228 7.030077
## Big Horn County 10.281215 0.001591856 31.67807 18.65292 8.483569
## Uinta County 7.125098 0.004224780 30.41911 19.97304 6.116938
## Campbell County 6.736509 0.001915071 32.11242 18.80502 5.210784
## Fremont County 8.179114 0.003042984 27.66606 18.37341 7.471202
## CHOLSCREEN CHD mean_income per_white density.pop.sqMile
## Lincoln County 89.23724 5.892397 83815.05 0.9471452 49.519484
## Natrona County 87.78411 5.727197 81182.01 0.9365881 137.793602
## Big Horn County 89.78467 7.500703 64738.93 0.9051550 8.484384
## Uinta County 87.09805 5.529374 75088.72 0.9220070 15.506028
## Campbell County 86.10895 4.274527 87915.39 0.9281575 14.340235
## Fremont County 88.63562 6.548194 70167.44 0.7219323 13.582238
#Scaling our data
#centering and scaling
z_mean_county<-scale(df,scale=TRUE)
Now that we have the aggregate counties we can continue our unsupervised analysis.
We will try to perform a pca and then a following clustering on the first two principal components on the counties in Wyoming to better understand the relationship between the variables.
#PCA
pca_WY = prcomp(z_mean_county)
# Eigenvalues to check how much of the variation is explained by each principal component
fviz_eig(pca_WY)
We see that the first three principal components can explain around 75% of the variation in our data.
Plotting the first two principal components on a biplot we get this solution :
#Biplot of dimensions
fviz_pca_biplot(pca_WY,label = "var")
The results for Wyoming seem a lot more clear cut than for states over all. The first principal component seems to show the negative correlation between health problems and grocery stores per sq mile plus mean income. When we focused on the grocery loading of Wyoming we see that counties that have less grocery stores per square mile also seem to have much higher counts of cardiovascular heart diseases (CHD), cancer rates and cholesterol. Depression does not seem to be correlated with any of the features in our data. We also observe that obesity appear more in counties with higher proportions of persons without a high school degree.
We then try to cluster the counties in Wyoming using the first two principal components. We first look at the optimal amount of clusters with kmeans again :
#We first extract the first two principal components
WY_pca = pca_WY$x[, c("PC1", "PC2")]
#storing the wss in a list for each cluster
wss_list_WY = list()
#looping over the data frame 20 times for each cluster number
for (i in 1:20) {
k_means = kmeans(WY_pca, centers=i, nstart=25)
wss_list_WY[[i]]=tibble(k=i, ss=k_means$tot.withinss)
}
#store the obtained wss in a list
wss_list_WY = bind_rows(wss_list_WY)
# Plot sum of squares vs. number of clusters
ggplot(wss_list_WY, aes(x=k, y=ss)) + geom_line() +
xlab("Number of Clusters") +
ylab("Within groups sum of squares")
This time a clear elbow shape appears at k=4 so we decide to rerun the algorithm with 4 clusters
# Compute again with k = 4 and visualize
WY_pca_res <- kmeans(WY_pca, 4, nstart = 50)
fviz_cluster(WY_pca_res, WY_pca, ellipse.type = "norm")
## Too few points to calculate an ellipse
The clusters allows us to observe some differentiation between counties. We see that Teton County, which is one of the most concentrated counties in the US in terms of Wealth and Sublette County, both ranked 12th and 13th respectively in terms of healthiest counties by the U.S. News & World Report in collaboration with the Aetna Foundation, are quite separate from the rest of the counties. From the previous biplot we know that counties with lower scores on the first principal component will have higher rates of health problems and lower amounts of grocery stores. This can be confirmed with the 4th cluster with counties such as Big Horn or Niobrara that appear on the food access research atlas as food deserts.
We observe that using clustering algorithms and principal component analysis that there seems to be a relation between health, income and grocery store availability. This result is especially prevalent in states for which lots of data points reveal food deserts. It would be interesting to continue this type of analysis for different states to see whether these conclusions continue to hold. We will look at using supervised machine learning next to better understand these relationships.
In this section, we test out several supervised learning models to determine the relationship between obesity and several demographic and economic predictors. We are interested in determining if obesity is related to these variables, among which is grocery store count as well. In other terms, can grocery store count, along with some of the other demographic/economic variables, be used to predict obesity?
We first join the data together, clean up, and select desired variables.
rm(list = ls())
grocery <- read.csv('grocery.csv')
health <- read.csv('health.csv')
race <- read.csv('race.csv')
income <- read.csv('income_cleaned.csv')
edu <- read.csv('education.csv')
# data has all variables, dat will be used for analysis
data <- inner_join(grocery, health, by='zcta')
data <- inner_join(data, race, by='zcta')
data <- inner_join(data, income, by='zcta')
data <- inner_join(data, edu, by='zcta')
# found an na in grocery_per1k - imputed value with median
sum(is.na(data))
## [1] 1
which(is.na(data), arr.ind=TRUE)
## row col
## [1,] 23785 6
data$grocery_per1k[is.na(data$grocery_per1k)] <- median(data$grocery_per1k, na.rm = T)
sum(is.na(data))
## [1] 0
# remove zcta, year, Year and geolocation (lat&long) for supervised learning
dat <- data[-c(1,2,8,9)]
# initial linear regressions suggested that some of the age
# ranges are not linearly associated with obesity
dat <- dat[-c(21,22,23,24,25,26,27,28,29,30,31,32,33)]
# and mixed race seems to have problems... multicollinearity?
dat <- dat[-19]
# Removing population stats due to multicollinearity
dat <- dat[-c(1,6,12)]
We prepare for model training by splitting the data into a training, validation, and test sets.
# Train/Validation/Test split
set.seed(33)
sample <- sample(c(TRUE, FALSE), nrow(dat), replace=TRUE, prob=c(0.6,0.4))
train <- dat[sample, ]
testvalid <- dat[!sample, ]
validtest <- sample(c(TRUE, FALSE), nrow(testvalid), replace=TRUE, prob=c(0.5,0.5))
valid <- testvalid[validtest, ]
test <- testvalid[!validtest, ]
# Obesity is variable 5
X_train <- train[-5]
y_train <- train[5]
X_valid <- valid[-5]
y_valid <- valid[5]
X_test <- test[-5]
y_test <- test[5]
We generate a histogram for obesity, the response variable.
# maybe skews left?
histogram(train$OBESITY)
We also take a look at the relationship between obesity and the grocery store count per squre mile.
plot(train$grocery_persqmile, train$OBESITY)
abline(lm(train$OBESITY~train$grocery_persqmile))
To select our variables, we use two different methods. We first consider backward variable selection using AIC as a metric. This procedure removes ‘white’ and ‘depression’ as predictors of obesity.
## Start: AIC=37705.43
## OBESITY ~ aland10 + grocery_count + grocery_per1k + grocery_persqmile +
## DEPRESSION + CANCER + CHOLSCREEN + CHD + white + black +
## alaskan_native + asian + pacific_islan + other + mean_income +
## X25.34_HS + X25.34_gtbach + X35.44_HS + X35.44_gtbach + X45.64_HS +
## X45.64_gtbach + gt65_HS + gt65_gtbach
##
## Df Sum of Sq RSS AIC
## - white 1 0 154427 37703
## - DEPRESSION 1 3 154430 37704
## <none> 154427 37705
## - grocery_per1k 1 21 154448 37706
## - alaskan_native 1 93 154520 37714
## - X25.34_HS 1 136 154563 37718
## - gt65_HS 1 179 154606 37723
## - mean_income 1 208 154636 37727
## - pacific_islan 1 244 154671 37730
## - other 1 251 154679 37731
## - X45.64_HS 1 257 154685 37732
## - X35.44_HS 1 258 154685 37732
## - X25.34_gtbach 1 270 154697 37733
## - CHOLSCREEN 1 336 154763 37741
## - grocery_count 1 454 154882 37754
## - X35.44_gtbach 1 513 154940 37760
## - aland10 1 646 155073 37775
## - grocery_persqmile 1 1958 156386 37919
## - X45.64_gtbach 1 2200 156628 37946
## - black 1 4348 158776 38179
## - gt65_gtbach 1 4444 158871 38189
## - asian 1 8373 162801 38608
## - CANCER 1 23126 177554 40093
## - CHD 1 49834 204262 42492
##
## Step: AIC=37703.44
## OBESITY ~ aland10 + grocery_count + grocery_per1k + grocery_persqmile +
## DEPRESSION + CANCER + CHOLSCREEN + CHD + black + alaskan_native +
## asian + pacific_islan + other + mean_income + X25.34_HS +
## X25.34_gtbach + X35.44_HS + X35.44_gtbach + X45.64_HS + X45.64_gtbach +
## gt65_HS + gt65_gtbach
##
## Df Sum of Sq RSS AIC
## - DEPRESSION 1 3 154430 37702
## <none> 154427 37703
## - grocery_per1k 1 21 154448 37704
## - alaskan_native 1 95 154522 37712
## - X25.34_HS 1 136 154563 37716
## - gt65_HS 1 179 154606 37721
## - mean_income 1 208 154636 37725
## - pacific_islan 1 244 154672 37729
## - other 1 252 154679 37729
## - X45.64_HS 1 258 154685 37730
## - X35.44_HS 1 258 154685 37730
## - X25.34_gtbach 1 270 154698 37731
## - CHOLSCREEN 1 337 154765 37739
## - X35.44_gtbach 1 515 154942 37758
## - aland10 1 646 155073 37773
## - grocery_count 1 664 155092 37775
## - grocery_persqmile 1 2189 156616 37942
## - X45.64_gtbach 1 2219 156646 37946
## - black 1 4394 158821 38182
## - gt65_gtbach 1 4444 158871 38187
## - asian 1 8385 162812 38607
## - CANCER 1 23232 177659 40101
## - CHD 1 51328 205755 42615
##
## Step: AIC=37701.73
## OBESITY ~ aland10 + grocery_count + grocery_per1k + grocery_persqmile +
## CANCER + CHOLSCREEN + CHD + black + alaskan_native + asian +
## pacific_islan + other + mean_income + X25.34_HS + X25.34_gtbach +
## X35.44_HS + X35.44_gtbach + X45.64_HS + X45.64_gtbach + gt65_HS +
## gt65_gtbach
##
## Df Sum of Sq RSS AIC
## <none> 154430 37702
## - grocery_per1k 1 22 154452 37702
## - alaskan_native 1 96 154526 37710
## - X25.34_HS 1 136 154566 37715
## - gt65_HS 1 181 154611 37720
## - mean_income 1 212 154642 37723
## - pacific_islan 1 244 154674 37727
## - X45.64_HS 1 259 154689 37728
## - X35.44_HS 1 260 154690 37729
## - other 1 264 154694 37729
## - X25.34_gtbach 1 272 154702 37730
## - CHOLSCREEN 1 335 154765 37737
## - X35.44_gtbach 1 514 154944 37757
## - grocery_count 1 662 155092 37773
## - aland10 1 663 155093 37773
## - grocery_persqmile 1 2200 156630 37942
## - X45.64_gtbach 1 2218 156648 37944
## - gt65_gtbach 1 4450 158880 38186
## - black 1 4456 158886 38187
## - asian 1 8587 163017 38626
## - CANCER 1 23972 178402 40170
## - CHD 1 55962 210392 42994
##
## Call:
## lm(formula = OBESITY ~ aland10 + grocery_count + grocery_per1k +
## grocery_persqmile + CANCER + CHOLSCREEN + CHD + black + alaskan_native +
## asian + pacific_islan + other + mean_income + X25.34_HS +
## X25.34_gtbach + X35.44_HS + X35.44_gtbach + X45.64_HS + X45.64_gtbach +
## gt65_HS + gt65_gtbach, data = train)
##
## Coefficients:
## (Intercept) aland10 grocery_count grocery_per1k
## 2.589e+01 -7.360e-04 -4.357e-02 -4.676e-02
## grocery_persqmile CANCER CHOLSCREEN CHD
## -8.550e-02 -1.908e+00 8.452e-02 2.075e+00
## black alaskan_native asian pacific_islan
## 1.698e-04 -2.052e-04 -4.115e-04 -8.604e-04
## other mean_income X25.34_HS X25.34_gtbach
## -7.616e-05 -5.601e-06 8.833e-03 -8.670e-03
## X35.44_HS X35.44_gtbach X45.64_HS X45.64_gtbach
## 1.245e-02 -1.269e-02 1.759e-02 -4.062e-02
## gt65_HS gt65_gtbach
## 1.072e-02 -4.848e-02
We use the results from the backward variable selection to fit an Ordinary Least Squares (OLS) model.
# Using results from backward step to fit OLS model
data.step <- train[-c(6,10)]
lm.fitstep <- lm(OBESITY ~ ., data = data.step)
summary(lm.fitstep)
##
## Call:
## lm(formula = OBESITY ~ ., data = data.step)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.5853 -1.8447 0.1737 1.9512 17.7564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.589e+01 1.142e+00 22.671 < 2e-16 ***
## aland10 -7.360e-04 8.590e-05 -8.568 < 2e-16 ***
## grocery_count -4.357e-02 5.090e-03 -8.560 < 2e-16 ***
## grocery_per1k -4.676e-02 3.011e-02 -1.553 0.120446
## grocery_persqmile -8.550e-02 5.478e-03 -15.608 < 2e-16 ***
## CANCER -1.908e+00 3.703e-02 -51.521 < 2e-16 ***
## CHOLSCREEN 8.452e-02 1.388e-02 6.089 1.16e-09 ***
## CHD 2.075e+00 2.637e-02 78.719 < 2e-16 ***
## black 1.698e-04 7.643e-06 22.212 < 2e-16 ***
## alaskan_native -2.052e-04 6.311e-05 -3.252 0.001148 **
## asian -4.115e-04 1.334e-05 -30.835 < 2e-16 ***
## pacific_islan -8.604e-04 1.656e-04 -5.197 2.05e-07 ***
## other -7.616e-05 1.409e-05 -5.406 6.53e-08 ***
## mean_income -5.601e-06 1.157e-06 -4.841 1.31e-06 ***
## X25.34_HS 8.833e-03 2.273e-03 3.887 0.000102 ***
## X25.34_gtbach -8.670e-03 1.579e-03 -5.490 4.07e-08 ***
## X35.44_HS 1.245e-02 2.320e-03 5.368 8.04e-08 ***
## X35.44_gtbach -1.269e-02 1.682e-03 -7.541 4.90e-14 ***
## X45.64_HS 1.759e-02 3.283e-03 5.357 8.57e-08 ***
## X45.64_gtbach -4.062e-02 2.592e-03 -15.672 < 2e-16 ***
## gt65_HS 1.072e-02 2.397e-03 4.474 7.73e-06 ***
## gt65_gtbach -4.848e-02 2.184e-03 -22.198 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.005 on 17100 degrees of freedom
## Multiple R-squared: 0.6911, Adjusted R-squared: 0.6907
## F-statistic: 1822 on 21 and 17100 DF, p-value: < 2.2e-16
We run F-tests to determine which variables have a significant impact on obesity.
# Based on F-value
dropterm(fit.full, test = "F")
## Single term deletions
##
## Model:
## OBESITY ~ aland10 + grocery_count + grocery_per1k + grocery_persqmile +
## DEPRESSION + CANCER + CHOLSCREEN + CHD + white + black +
## alaskan_native + asian + pacific_islan + other + mean_income +
## X25.34_HS + X25.34_gtbach + X35.44_HS + X35.44_gtbach + X45.64_HS +
## X45.64_gtbach + gt65_HS + gt65_gtbach
## Df Sum of Sq RSS AIC F Value Pr(F)
## <none> 154427 37705
## aland10 1 646 155073 37775 71.5 < 2.2e-16 ***
## grocery_count 1 454 154882 37754 50.3 1.368e-12 ***
## grocery_per1k 1 21 154448 37706 2.3 0.1302711
## grocery_persqmile 1 1958 156386 37919 216.8 < 2.2e-16 ***
## DEPRESSION 1 3 154430 37704 0.3 0.5859717
## CANCER 1 23126 177554 40093 2560.5 < 2.2e-16 ***
## CHOLSCREEN 1 336 154763 37741 37.2 1.091e-09 ***
## CHD 1 49834 204262 42492 5517.6 < 2.2e-16 ***
## white 1 0 154427 37703 0.0 0.9425812
## black 1 4348 158776 38179 481.4 < 2.2e-16 ***
## alaskan_native 1 93 154520 37714 10.3 0.0013473 **
## asian 1 8373 162801 38608 927.1 < 2.2e-16 ***
## pacific_islan 1 244 154671 37730 27.0 2.053e-07 ***
## other 1 251 154679 37731 27.8 1.348e-07 ***
## mean_income 1 208 154636 37727 23.0 1.593e-06 ***
## X25.34_HS 1 136 154563 37718 15.0 0.0001064 ***
## X25.34_gtbach 1 270 154697 37733 29.9 4.713e-08 ***
## X35.44_HS 1 258 154685 37732 28.5 9.421e-08 ***
## X35.44_gtbach 1 513 154940 37760 56.8 5.112e-14 ***
## X45.64_HS 1 257 154685 37732 28.5 9.536e-08 ***
## X45.64_gtbach 1 2200 156628 37946 243.6 < 2.2e-16 ***
## gt65_HS 1 179 154606 37723 19.8 8.688e-06 ***
## gt65_gtbach 1 4444 158871 38189 492.0 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit1 <- update(fit.full, ~ . - white)
dropterm(fit1, sorted = TRUE, test = "F")
## Single term deletions
##
## Model:
## OBESITY ~ aland10 + grocery_count + grocery_per1k + grocery_persqmile +
## DEPRESSION + CANCER + CHOLSCREEN + CHD + black + alaskan_native +
## asian + pacific_islan + other + mean_income + X25.34_HS +
## X25.34_gtbach + X35.44_HS + X35.44_gtbach + X45.64_HS + X45.64_gtbach +
## gt65_HS + gt65_gtbach
## Df Sum of Sq RSS AIC F Value Pr(F)
## DEPRESSION 1 3 154430 37702 0.3 0.5891152
## <none> 154427 37703
## grocery_per1k 1 21 154448 37704 2.3 0.1297049
## alaskan_native 1 95 154522 37712 10.5 0.0012186 **
## X25.34_HS 1 136 154563 37716 15.0 0.0001064 ***
## gt65_HS 1 179 154606 37721 19.8 8.704e-06 ***
## mean_income 1 208 154636 37725 23.1 1.587e-06 ***
## pacific_islan 1 244 154672 37729 27.1 2.005e-07 ***
## other 1 252 154679 37729 27.9 1.318e-07 ***
## X45.64_HS 1 258 154685 37730 28.5 9.431e-08 ***
## X35.44_HS 1 258 154685 37730 28.5 9.312e-08 ***
## X25.34_gtbach 1 270 154698 37731 29.9 4.537e-08 ***
## CHOLSCREEN 1 337 154765 37739 37.4 1.003e-09 ***
## X35.44_gtbach 1 515 154942 37758 57.0 4.636e-14 ***
## aland10 1 646 155073 37773 71.5 < 2.2e-16 ***
## grocery_count 1 664 155092 37775 73.5 < 2.2e-16 ***
## grocery_persqmile 1 2189 156616 37942 242.3 < 2.2e-16 ***
## X45.64_gtbach 1 2219 156646 37946 245.7 < 2.2e-16 ***
## black 1 4394 158821 38182 486.5 < 2.2e-16 ***
## gt65_gtbach 1 4444 158871 38187 492.1 < 2.2e-16 ***
## asian 1 8385 162812 38607 928.4 < 2.2e-16 ***
## CANCER 1 23232 177659 40101 2572.3 < 2.2e-16 ***
## CHD 1 51328 205755 42615 5683.3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit2 <- update(fit1, ~ . - DEPRESSION)
dropterm(fit2, sorted = TRUE, test = "F")
## Single term deletions
##
## Model:
## OBESITY ~ aland10 + grocery_count + grocery_per1k + grocery_persqmile +
## CANCER + CHOLSCREEN + CHD + black + alaskan_native + asian +
## pacific_islan + other + mean_income + X25.34_HS + X25.34_gtbach +
## X35.44_HS + X35.44_gtbach + X45.64_HS + X45.64_gtbach + gt65_HS +
## gt65_gtbach
## Df Sum of Sq RSS AIC F Value Pr(F)
## <none> 154430 37702
## grocery_per1k 1 22 154452 37702 2.4 0.1204456
## alaskan_native 1 96 154526 37710 10.6 0.0011483 **
## X25.34_HS 1 136 154566 37715 15.1 0.0001019 ***
## gt65_HS 1 181 154611 37720 20.0 7.732e-06 ***
## mean_income 1 212 154642 37723 23.4 1.306e-06 ***
## pacific_islan 1 244 154674 37727 27.0 2.050e-07 ***
## X45.64_HS 1 259 154689 37728 28.7 8.571e-08 ***
## X35.44_HS 1 260 154690 37729 28.8 8.044e-08 ***
## other 1 264 154694 37729 29.2 6.525e-08 ***
## X25.34_gtbach 1 272 154702 37730 30.1 4.067e-08 ***
## CHOLSCREEN 1 335 154765 37737 37.1 1.164e-09 ***
## X35.44_gtbach 1 514 154944 37757 56.9 4.896e-14 ***
## grocery_count 1 662 155092 37773 73.3 < 2.2e-16 ***
## aland10 1 663 155093 37773 73.4 < 2.2e-16 ***
## grocery_persqmile 1 2200 156630 37942 243.6 < 2.2e-16 ***
## X45.64_gtbach 1 2218 156648 37944 245.6 < 2.2e-16 ***
## gt65_gtbach 1 4450 158880 38186 492.7 < 2.2e-16 ***
## black 1 4456 158886 38187 493.4 < 2.2e-16 ***
## asian 1 8587 163017 38626 950.8 < 2.2e-16 ***
## CANCER 1 23972 178402 40170 2654.4 < 2.2e-16 ***
## CHD 1 55962 210392 42994 6196.7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# grocery_per_1k not signif
fit3 <- update(fit2, ~ . - grocery_per1k)
dropterm(fit3, sorted = TRUE, test = "F")
## Single term deletions
##
## Model:
## OBESITY ~ aland10 + grocery_count + grocery_persqmile + CANCER +
## CHOLSCREEN + CHD + black + alaskan_native + asian + pacific_islan +
## other + mean_income + X25.34_HS + X25.34_gtbach + X35.44_HS +
## X35.44_gtbach + X45.64_HS + X45.64_gtbach + gt65_HS + gt65_gtbach
## Df Sum of Sq RSS AIC F Value Pr(F)
## <none> 154452 37702
## alaskan_native 1 92 154544 37710 10.2 0.001384 **
## X25.34_HS 1 138 154590 37715 15.3 9.242e-05 ***
## gt65_HS 1 181 154632 37720 20.0 7.798e-06 ***
## mean_income 1 207 154659 37723 22.9 1.731e-06 ***
## pacific_islan 1 246 154698 37727 27.2 1.829e-07 ***
## other 1 256 154707 37728 28.3 1.055e-07 ***
## X45.64_HS 1 260 154712 37729 28.8 8.245e-08 ***
## X35.44_HS 1 263 154715 37729 29.2 6.718e-08 ***
## X25.34_gtbach 1 275 154727 37731 30.4 3.495e-08 ***
## CHOLSCREEN 1 331 154783 37737 36.6 1.452e-09 ***
## X35.44_gtbach 1 517 154969 37757 57.3 3.965e-14 ***
## aland10 1 677 155129 37775 75.0 < 2.2e-16 ***
## grocery_count 1 699 155151 37777 77.4 < 2.2e-16 ***
## X45.64_gtbach 1 2229 156681 37945 246.8 < 2.2e-16 ***
## grocery_persqmile 1 2309 156761 37954 255.7 < 2.2e-16 ***
## gt65_gtbach 1 4457 158908 38187 493.4 < 2.2e-16 ***
## black 1 4520 158972 38194 500.4 < 2.2e-16 ***
## asian 1 8565 163017 38624 948.3 < 2.2e-16 ***
## CANCER 1 23954 178406 40169 2652.2 < 2.2e-16 ***
## CHD 1 56019 210471 42999 6202.5 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit3)
##
## Call:
## lm(formula = OBESITY ~ aland10 + grocery_count + grocery_persqmile +
## CANCER + CHOLSCREEN + CHD + black + alaskan_native + asian +
## pacific_islan + other + mean_income + X25.34_HS + X25.34_gtbach +
## X35.44_HS + X35.44_gtbach + X45.64_HS + X45.64_gtbach + gt65_HS +
## gt65_gtbach, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.5574 -1.8459 0.1776 1.9475 17.8060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.592e+01 1.142e+00 22.694 < 2e-16 ***
## aland10 -7.427e-04 8.579e-05 -8.658 < 2e-16 ***
## grocery_count -4.449e-02 5.056e-03 -8.798 < 2e-16 ***
## grocery_persqmile -8.671e-02 5.423e-03 -15.990 < 2e-16 ***
## CANCER -1.907e+00 3.703e-02 -51.500 < 2e-16 ***
## CHOLSCREEN 8.401e-02 1.388e-02 6.053 1.45e-09 ***
## CHD 2.073e+00 2.632e-02 78.756 < 2e-16 ***
## black 1.706e-04 7.626e-06 22.370 < 2e-16 ***
## alaskan_native -2.017e-04 6.308e-05 -3.199 0.00138 **
## asian -4.104e-04 1.333e-05 -30.794 < 2e-16 ***
## pacific_islan -8.639e-04 1.656e-04 -5.218 1.83e-07 ***
## other -7.479e-05 1.406e-05 -5.319 1.06e-07 ***
## mean_income -5.532e-06 1.156e-06 -4.784 1.73e-06 ***
## X25.34_HS 8.886e-03 2.272e-03 3.911 9.24e-05 ***
## X25.34_gtbach -8.711e-03 1.579e-03 -5.517 3.49e-08 ***
## X35.44_HS 1.253e-02 2.320e-03 5.401 6.72e-08 ***
## X35.44_gtbach -1.273e-02 1.682e-03 -7.569 3.97e-14 ***
## X45.64_HS 1.761e-02 3.283e-03 5.364 8.24e-08 ***
## X45.64_gtbach -4.071e-02 2.591e-03 -15.710 < 2e-16 ***
## gt65_HS 1.072e-02 2.397e-03 4.472 7.80e-06 ***
## gt65_gtbach -4.852e-02 2.184e-03 -22.214 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.005 on 17101 degrees of freedom
## Multiple R-squared: 0.691, Adjusted R-squared: 0.6907
## F-statistic: 1912 on 20 and 17101 DF, p-value: < 2.2e-16
We run some diagnostics to determine if the fit is correct.
pred = predict(fit3)
sres = stdres(fit3)
student = rstandard(fit3)
# Variance looks fairly stable
plot(pred, sres, pch=16); abline(h=0)
plot(pred, student, pch=16); abline(h=0)
# Variance increases with OBS but not wider/narrower spread
plot(train$OBESITY, sres, pch=16); abline(h=0)
# residual values increase with obesity, though variance doesn't change much
plot(sres, xlab = "Index", ylab = "Standardized residual", ylim = c(-4.0,4.0))
abline(h = -3.0, lty = 2)
abline(h = 3.0, lty = 2)
boxplot(student, pch=16)
# that's a lot of potential outliers? but maybe ok for 17k obs?
qqnorm(student, pch=16); qqline(student)
# May not be normal at the ends, middle section looks good
# Shapiro-Wilk test for normality
# H0: model errors are normally distributed
# !must have 5000 or fewer observations! we have 17k in training
#shapiro.test(student)
# Breusch-Pagan - test for Homoskedasticity
# H0: residuals are homoskedastic
bptest(fit3, studentize=F)
##
## Breusch-Pagan test
##
## data: fit3
## BP = 840.07, df = 20, p-value < 2.2e-16
# BP = 840.07, df = 20, p-value < 2.2e-16
# Reject H0 in favor of H1 - residuals are heteroskedastic
boxcox(fit3, lambda=seq(-2, 2, 0.1), plotit = T)
# lambda ~ 1 => no transformation
# Multicollinearity?
multico = solve(cor(train[, -5]))
summary(multico)
## aland10 grocery_count grocery_per1k grocery_persqmile
## Min. :-0.15116 Min. :-1.76969 Min. :-0.359921 Min. :-1.373018
## 1st Qu.:-0.02881 1st Qu.:-0.26310 1st Qu.:-0.034050 1st Qu.:-0.042703
## Median : 0.01922 Median :-0.02262 Median : 0.007021 Median : 0.006537
## Mean : 0.04871 Mean :-0.07580 Mean : 0.043069 Mean : 0.041998
## 3rd Qu.: 0.03236 3rd Qu.: 0.06762 3rd Qu.: 0.064814 3rd Qu.: 0.046970
## Max. : 1.05093 Max. : 4.32272 Max. : 1.100502 Max. : 1.701386
## DEPRESSION CANCER CHOLSCREEN CHD
## Min. :-0.81683 Min. :-3.998919 Min. :-2.489697 Min. :-3.99892
## 1st Qu.:-0.02004 1st Qu.:-0.172333 1st Qu.:-0.092896 1st Qu.:-0.15872
## Median : 0.05596 Median : 0.002078 Median : 0.013644 Median : 0.06038
## Mean : 0.08867 Mean :-0.011539 Mean : 0.004604 Mean : 0.14291
## 3rd Qu.: 0.14708 3rd Qu.: 0.130012 3rd Qu.: 0.147553 3rd Qu.: 0.43957
## Max. : 1.49675 Max. : 5.975409 Max. : 2.884031 Max. : 4.93210
## white black alaskan_native asian
## Min. :-1.76969 Min. :-1.28967 Min. :-0.281180 Min. :-0.55944
## 1st Qu.:-0.10020 1st Qu.:-0.03393 1st Qu.:-0.049072 1st Qu.:-0.09819
## Median :-0.02397 Median : 0.03631 Median :-0.002827 Median :-0.02564
## Mean : 0.04264 Mean : 0.07592 Mean : 0.031859 Mean : 0.03062
## 3rd Qu.: 0.08933 3rd Qu.: 0.20468 3rd Qu.: 0.031240 3rd Qu.: 0.03869
## Max. : 2.35028 Max. : 1.79860 Max. : 1.128709 Max. : 1.66179
## pacific_islan other mean_income X25.34_HS
## Min. :-0.42184 Min. :-0.79234 Min. :-0.969854 Min. :-0.33554
## 1st Qu.:-0.03401 1st Qu.:-0.04685 1st Qu.:-0.105733 1st Qu.:-0.03421
## Median :-0.01606 Median : 0.07831 Median : 0.002078 Median :-0.01208
## Mean : 0.03351 Mean : 0.08588 Mean : 0.075640 Mean : 0.02849
## 3rd Qu.: 0.04656 3rd Qu.: 0.11643 3rd Qu.: 0.074695 3rd Qu.: 0.02515
## Max. : 1.17373 Max. : 1.78298 Max. : 2.949547 Max. : 1.27845
## X25.34_gtbach X35.44_HS X35.44_gtbach X45.64_HS
## Min. :-0.559633 Min. :-0.38228 Min. :-0.72127 Min. :-0.74926
## 1st Qu.:-0.128228 1st Qu.:-0.06835 1st Qu.:-0.10348 1st Qu.:-0.09906
## Median :-0.002162 Median :-0.01173 Median :-0.02262 Median :-0.02267
## Mean : 0.029735 Mean : 0.02499 Mean : 0.01741 Mean : 0.03375
## 3rd Qu.: 0.054688 3rd Qu.: 0.05754 3rd Qu.: 0.02362 3rd Qu.: 0.03953
## Max. : 2.089272 Max. : 1.47299 Max. : 2.43098 Max. : 2.01249
## X45.64_gtbach gt65_HS gt65_gtbach
## Min. :-1.063068 Min. :-0.907109 Min. :-1.06307
## 1st Qu.:-0.169230 1st Qu.:-0.053430 1st Qu.:-0.10132
## Median :-0.031085 Median :-0.002981 Median :-0.01432
## Mean :-0.003589 Mean : 0.041194 Mean : 0.00278
## 3rd Qu.: 0.053225 3rd Qu.: 0.077797 3rd Qu.: 0.02808
## Max. : 3.647515 Max. : 1.828711 Max. : 2.33695
# Now, no values above 10!
# Multicollinearity fixed by removing population variables
The fitted value versus residual plot shows that there may not be much concern with heteroskedasticity. However, since the Breusch-Pagan test rejected homoskedasticty (as did the White test) , We fit a weighted least squares model, which did not improve the situation as far as Breusch-Pagan or White are concerned. We thus will stick with OLS as done in fit3 for ease of interpretability.
# Using results from manual F-Test variable selection to fit OLS model
data.F <- train[-c(3,6,10)]
lm.fitF <- lm(OBESITY ~ ., data = data.F)
summary(lm.fitF)
##
## Call:
## lm(formula = OBESITY ~ ., data = data.F)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.5574 -1.8459 0.1776 1.9475 17.8060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.592e+01 1.142e+00 22.694 < 2e-16 ***
## aland10 -7.427e-04 8.579e-05 -8.658 < 2e-16 ***
## grocery_count -4.449e-02 5.056e-03 -8.798 < 2e-16 ***
## grocery_persqmile -8.671e-02 5.423e-03 -15.990 < 2e-16 ***
## CANCER -1.907e+00 3.703e-02 -51.500 < 2e-16 ***
## CHOLSCREEN 8.401e-02 1.388e-02 6.053 1.45e-09 ***
## CHD 2.073e+00 2.632e-02 78.756 < 2e-16 ***
## black 1.706e-04 7.626e-06 22.370 < 2e-16 ***
## alaskan_native -2.017e-04 6.308e-05 -3.199 0.00138 **
## asian -4.104e-04 1.333e-05 -30.794 < 2e-16 ***
## pacific_islan -8.639e-04 1.656e-04 -5.218 1.83e-07 ***
## other -7.479e-05 1.406e-05 -5.319 1.06e-07 ***
## mean_income -5.532e-06 1.156e-06 -4.784 1.73e-06 ***
## X25.34_HS 8.886e-03 2.272e-03 3.911 9.24e-05 ***
## X25.34_gtbach -8.711e-03 1.579e-03 -5.517 3.49e-08 ***
## X35.44_HS 1.253e-02 2.320e-03 5.401 6.72e-08 ***
## X35.44_gtbach -1.273e-02 1.682e-03 -7.569 3.97e-14 ***
## X45.64_HS 1.761e-02 3.283e-03 5.364 8.24e-08 ***
## X45.64_gtbach -4.071e-02 2.591e-03 -15.710 < 2e-16 ***
## gt65_HS 1.072e-02 2.397e-03 4.472 7.80e-06 ***
## gt65_gtbach -4.852e-02 2.184e-03 -22.214 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.005 on 17101 degrees of freedom
## Multiple R-squared: 0.691, Adjusted R-squared: 0.6907
## F-statistic: 1912 on 20 and 17101 DF, p-value: < 2.2e-16
# XGBoost is an ensemble method, similar to Random Forest
# Except not random, tree N+1 is based on the loss from tree N
# https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html
# XGB Params: https://xgboost.readthedocs.io/en/latest/parameter.html?highlight=objective
xgbtrain = xgb.DMatrix(as.matrix(sapply(X_train, as.numeric)), label=y_train$OBESITY)
xgbvalid = xgb.DMatrix(as.matrix(sapply(X_valid, as.numeric)), label=y_valid$OBESITY)
watchlist <- list(train=xgbtrain, test=xgbvalid)
# max.depth = 4 and nrounds=10 yielded minimum test-rmse
xgb.fit <- xgb.train(data=xgbtrain, max.depth = 4, eta = 1, nthread = 2, nrounds = 10,
watchlist=watchlist, objective = "reg:squarederror")
## [1] train-rmse:3.581894 test-rmse:3.708840
## [2] train-rmse:3.351098 test-rmse:3.486633
## [3] train-rmse:3.206715 test-rmse:3.355757
## [4] train-rmse:3.094945 test-rmse:3.278782
## [5] train-rmse:3.008885 test-rmse:3.212923
## [6] train-rmse:2.931062 test-rmse:3.142064
## [7] train-rmse:2.882455 test-rmse:3.100884
## [8] train-rmse:2.855202 test-rmse:3.082287
## [9] train-rmse:2.829697 test-rmse:3.063717
## [10] train-rmse:2.802250 test-rmse:3.047965
# linear boosting - not as good
lin.fit <- xgb.train(data=xgbtrain, booster = "gblinear", nthread = 2, nrounds=10,
watchlist=watchlist, objective = "reg:squarederror")
## [1] train-rmse:7.325503 test-rmse:7.566465
## [2] train-rmse:5.517011 test-rmse:5.600549
## [3] train-rmse:4.827558 test-rmse:4.883340
## [4] train-rmse:4.486822 test-rmse:4.528419
## [5] train-rmse:4.301206 test-rmse:4.327606
## [6] train-rmse:4.195182 test-rmse:4.213690
## [7] train-rmse:4.123113 test-rmse:4.135740
## [8] train-rmse:4.070705 test-rmse:4.080579
## [9] train-rmse:4.028052 test-rmse:4.036772
## [10] train-rmse:3.989868 test-rmse:3.999126
summary(xgb.fit)
## Length Class Mode
## handle 1 xgb.Booster.handle externalptr
## raw 20936 -none- raw
## niter 1 -none- numeric
## evaluation_log 3 data.table list
## call 8 -none- call
## params 5 -none- list
## callbacks 2 -none- list
## feature_names 23 -none- character
## nfeatures 1 -none- numeric
# Plot of variable importance
# Gain - is the improvement in accuracy brought by a feature to the branches it is on
# Cover - measures the relative quantity of observations concerned by a feature
# Frequency - is a simpler way to measure the Gain. It just counts the number of
# times a feature is used in all generated trees
# !You should not use it (unless you know why you want to use it)!
importance_matrix <- xgb.importance(model = xgb.fit)
print(importance_matrix)
## Feature Gain Cover Frequency
## 1: X45.64_gtbach 0.4430384717 0.0675804127 0.061224490
## 2: mean_income 0.1997192237 0.1498730422 0.108843537
## 3: CHD 0.0948227668 0.1719267660 0.136054422
## 4: black 0.0444389604 0.1138619861 0.081632653
## 5: CANCER 0.0424096513 0.1094290474 0.095238095
## 6: asian 0.0386745134 0.0782305625 0.054421769
## 7: DEPRESSION 0.0330242819 0.0519118143 0.102040816
## 8: grocery_persqmile 0.0300632592 0.0131440428 0.047619048
## 9: gt65_gtbach 0.0272524100 0.0133411596 0.040816327
## 10: CHOLSCREEN 0.0197877196 0.0578223992 0.054421769
## 11: aland10 0.0086544081 0.0475723236 0.081632653
## 12: white 0.0079322873 0.0061821681 0.047619048
## 13: other 0.0058725939 0.0584079092 0.040816327
## 14: X45.64_HS 0.0019142159 0.0324760941 0.020408163
## 15: gt65_HS 0.0012862149 0.0144201918 0.013605442
## 16: X35.44_gtbach 0.0006004792 0.0133966443 0.006802721
## 17: grocery_count 0.0005085427 0.0004234362 0.006802721
xgb.plot.importance(importance_matrix = importance_matrix)
xgboost’s gain shows that percent of the population of age 45-65 with a bachelor’s degree or greater, mean income, percent of population with chronic heart disease, percent of population which is black and percent of population with cancer are the five most important variables in determining percent of population with obesity. Number of grocery stores per square mile is the eigth most important variable.
# View model trees
# with depth=5 the tree ends up having 608 nodes/leaves
# Commented out since output isn't helpful - >600 lines describing the tree
#xgb.dump(xgb.fit, with_stats = TRUE)
# For example tree with depth = 2... plotting a 5 level tree is not helpful
# With this many variables, its still not visible
xgb.fit2 <- xgb.train(data=xgbtrain, max.depth = 4, eta = 1, nthread = 2, nrounds = 10,
watchlist=watchlist, objective = "reg:squarederror")
## [1] train-rmse:3.581894 test-rmse:3.708840
## [2] train-rmse:3.351098 test-rmse:3.486633
## [3] train-rmse:3.206715 test-rmse:3.355757
## [4] train-rmse:3.094945 test-rmse:3.278782
## [5] train-rmse:3.008885 test-rmse:3.212923
## [6] train-rmse:2.931062 test-rmse:3.142064
## [7] train-rmse:2.882455 test-rmse:3.100884
## [8] train-rmse:2.855202 test-rmse:3.082287
## [9] train-rmse:2.829697 test-rmse:3.063717
## [10] train-rmse:2.802250 test-rmse:3.047965
xgb.plot.tree(model = xgb.fit2, trees=0) # just from 0th root node
# Compare RMSE of linear and XGB Models
xgbtest = xgb.DMatrix(as.matrix(sapply(X_test, as.numeric)), label=y_test$OBESITY)
linpred <- predict.lm(fit3, test)
xgbpred <- predict(xgb.fit, xgbtest)
# RMSE of linear model on test set
sqrt(mean((test$OBESITY - linpred)^2))
## [1] 2.986013
# RMSE of XGBoost model on test set
sqrt(mean((test$OBESITY - xgbpred)^2))
## [1] 3.034223
When controlling for variables such as income, race, education and various illnesses, we have found a negative association between the number of grocery stores per square mile in a zip code area and percent of inhabitants who are obese. For every increase of one grocery store per square mile, obesity decreases by 0.0867%. We found some other interesting associations with obesity. As the percentage of inhabitants who are black increases, so does obesity. Obesity decreases with income and education.
Using xgboost to look at non-linear associations with obesity, we found that education (percent of inhabitants between 45 and 65 years of age with a bachelors or greater), mean income and being black are important variables in determining obesity. Number of grocery stores per square mile, to a lesser degree, is also important to explaining obesity.
With a RMSE of 2.986 our linear model, fit3, outperformed the XGBoost model, which had a RMSE of 3.034, on the test data.
In this section, we look at Yelp reviews for grocery stores and fast food restaurants in different areas of the United States. We compare data across locations in the US which are considered food deserts and locations with abundant access to a variety of grocery stores and healthy options. The goal of this section is to explore the following:
By exploring these questions, we hope to gain insight into how customer attitudes and opinions vary between different types of stores and more importantly, between areas which are considered food deserts vs. not.
In order to correctly specify which areas of the United States to look at Yelp reviews for, we first have to come up with a list of areas which are considered food deserts and a list or areas which are not considered food deserts. We begin this process by loading in data from the Food Access Research Atlas generated by the United States Department of Agriculture (USDA). This data set was downloaded from the USDA website.
# read in food desert data
food_des_data <- read.csv('food_access_data.csv')
Next, we find areas of the US that are food deserts by only keeping rows where the LILATracts_1And20 and LILATracts_Vehicle flags are both 1. These two variables have the following meaning, as defined by the USDA:
We keep track of these low-income, low-access areas by keeping their CensusTract- a unique identifier for a geographic region defined for census-purposes. We additionally add a column to flag that these tracts are food deserts.
# find all low-income low-access areas
food_desert <- food_des_data[food_des_data$LILATracts_1And20 == 1 &
food_des_data$LILATracts_Vehicle == 1, ]
# keep census tract number and add a flag for food desert
food_desert <- food_desert[1]
food_desert$food_desert <- 1
Next, we find areas in the US that are not food deserts by keeping only rows where all the flags for low income and low access are set to 0. In a similar fashion as before, we keep track of these areas by keeping their CensusTract and we add a flag column for food deserts (this time set to 0 as these areas are not food deserts).
# find non-deserts (non low-income and non low-access)
non_desert <- food_des_data[food_des_data$LowIncomeTracts == 0 &
food_des_data$LA1and10 == 0 &
food_des_data$LAhalfand10 == 0 &
food_des_data$LA1and20 == 0 &
food_des_data$LATracts_half == 0 &
food_des_data$LATracts1== 0 &
food_des_data$LATracts10 == 0 &
food_des_data$LATracts20 == 0, ]
# keep census tract number and add a flag for food desert
non_desert <- non_desert[1]
non_desert$food_desert <- 0
The Yelp review API allows us to specify the location for the area we want to look at by specifying the longitude and the latitude. Therefore, we now need to add longitude and latitude information to our food deseert data (which currently only contains census tract IDs). We accomplish this by loading in a Census Tracts file downloaded from the US Census Bureau. We combine the longitudes and latitudes from that file with our food desert data frame by joining the frames based on the tract ID. Following the join, we check to ensure there are no missing values in the final data set.
# combine food desert and non-desert frames
food_deserts <- rbind(food_desert[sample(nrow(food_desert), 600), ],
non_desert[sample(nrow(non_desert), 600), ])
# read in census tract data
census_tracts <- read.delim("2021_Gaz_tracts_national.txt")
census_tracts <- census_tracts %>% dplyr::rename(CensusTract=GEOID)
# add longitude/latitude data to food desert frame
food_deserts <- inner_join(food_deserts, census_tracts[c(2,7,8)], by='CensusTract')
# check to make sure there are no missing values
sum(is.na(food_deserts))
## [1] 0
Now that we have a list of locations considered food deserts and non-deserts, we can start requesting Yelp rating and review data.
We start off by setting up the API key and data frames that will be used to store the ratings. We create separate data frames to store information for grocery stores and fast food restaurants. Since we will be requesting 5 stores per location, we multiply the total row numbers by 5.
# API and data frame setup
yelp_key <- 'nvkQTNqwyxo-nPeAoWCP6Zm3kyEg4416UpsiTX_olm40tNXis8L42hajgU7P36Hv8aCVB3BFHvGrItUCT-4gkIi1wsJfkgUVDRNoYbOdJ3MRrCyr3goshIZLLl6oYnYx'
num_rows <- nrow(food_deserts)
# creating empty data frames
grocery_general <- data.frame(matrix(ncol = 6, nrow = num_rows*5))
fastfood_general <- data.frame(matrix(ncol = 6, nrow = num_rows*5))
# renaming columns
cols <- c("store_name", "store_id", "review_count", "ratings", "categories",
"food_desert")
colnames(grocery_general) <- cols
colnames(fastfood_general) <- cols
The first set of API requests will be made for the Business Search endpoint. This endpoint returns a list of stores and general information about them based on the location that is provided. To fill in the data frames, we created a function that goes through our list of locations and requests 5 stores per location. We then store some basic information like the store ID, review count, and average rating for each store.
# this function returns a data frame with general store information based on requested locations and store type
get_general_ratings <- function(food_deserts, num_rows, store_general, store_type, yelp_key) {
for (row in 1:num_rows) {
# get longitude and latitude of location
lat <- food_deserts[row, "INTPTLAT"]
long <- food_deserts[row, "INTPTLONG"]
# request a list of 5 stores of specified type closest to the location
params <- list('latitude' = lat, 'longitude' = long, 'categories' = store_type,
'limit' = 5)
result <- GET("https://api.yelp.com/v3/businesses/search",
add_headers(Authorization = paste("Bearer", yelp_key)),
query = params)
# extract query results
res_JSON <- fromJSON(httr::content(result, type = "text", encoding = "UTF-8"))
# save parameters from the results to the data frame
total_found <- length(res_JSON$businesses$name)
if(total_found > 0){
for(store in (total_found-1):0){
store_general[row*5-store,"store_name"] <- res_JSON$business$name[store+1]
store_general[row*5-store,"store_id"] <- res_JSON$business$id[store+1]
store_general[row*5-store,"review_count"] <-
res_JSON$business$review_count[store+1]
store_general[row*5-store,"ratings"] <- res_JSON$business$rating[store+1]
categories <- res_JSON$businesses$categories[[store+1]]$alias
store_general[row*5-store,"categories"] <- paste(categories, collapse=",")
store_general[row*5-store,"food_desert"] <- food_deserts[row, "food_desert"]
}
}
}
return(store_general)
}
After creating the function, we make a call to get data on grocery stores. In the code below, the function call is commented out in order to prevent having to make all API requests every time the notebook is compiled (we are limited to only 5,000 requests per day). Instead, we load in a saved CSV files with the output that we saved the first time the function was run.
After the data frame is filled in, we remove any rows which are duplicates (cases where two locations were close to each other and thus the same stores were selected) as well as any rows with missing data (no reviews).
# collect general data on grocery stores
#grocery_general <- get_general_ratings(food_deserts, num_rows, grocery_general,
# 'grocery', yelp_key)
#write.csv(grocery_general,"grocery_general_raw.csv", row.names = FALSE)
grocery_general <- read.csv('grocery_general_raw.csv')
# remove all rows with missing values and duplicates and save final data frame
grocery_general <- grocery_general[complete.cases(grocery_general), ]
grocery_general <- grocery_general[!duplicated(grocery_general),]
kable_styling(kable(head(grocery_general)))
| store_name | store_id | review_count | ratings | categories | food_desert |
|---|---|---|---|---|---|
| Jim’s Market | zXsu5PpGUHOnuFQiGbRGoQ | 1 | 4.0 | grocery | 1 |
| Dollar General | 2UI6RQ1nK9Bu0YrZbGaf-w | 2 | 1.0 | grocery,discountstore | 1 |
| Marion Walmart Neighborhood Market | 8dv3v6zieubimwOzNCa3pQ | 2 | 5.0 | deptstores,grocery | 1 |
| Walmart Supercenter | zDvg2Zpvcd2k5isyJbbSmQ | 20 | 2.0 | deptstores,grocery | 1 |
| Kroger | 5IOczs_0-7dFnGiOwyM4hg | 9 | 4.0 | grocery,bakeries,delis | 1 |
| The BE Hive | Yu_tXvUoyOj5SpeDUOZwnQ | 53 | 4.5 | gourmet,grocery,catering | 1 |
We repeat the same process described above to get fast food restaurant data (category name for fast food is ‘hotdogs’ using the Yelp API).
# collect general data on fast food restaurants
#fastfood_general <- get_general_ratings(food_deserts, num_rows, fastfood_general,
# 'hotdogs', yelp_key)
#write.csv(fastfood_general,"fastfood_general_raw.csv", row.names = FALSE)
fastfood_general <- read.csv('fastfood_general_raw.csv')
# remove all rows with missing values and duplicates and save final data frame
fastfood_general <- fastfood_general[complete.cases(fastfood_general), ]
fastfood_general <- fastfood_general[!duplicated(fastfood_general),]
kable_styling(kable(head(fastfood_general)))
| store_name | store_id | review_count | ratings | categories | food_desert |
|---|---|---|---|---|---|
| Captain D’s | 59PZLJi5-SVfmeaJoXX0fw | 4 | 3.5 | seafood,hotdogs,tradamerican | 1 |
| Popeyes Louisiana Kitchen | 4cMH1c2dPcyEjCoAU-EndQ | 8 | 3.0 | chicken_wings,hotdogs | 1 |
| Wendy’s | 9TWVAP1uubHxcr4wveXY6w | 9 | 3.0 | hotdogs,burgers | 1 |
| Chick-fil-A | fTm-hODqvLVtIWZYRGsHNQ | 20 | 2.5 | hotdogs,chickenshop | 1 |
| The Port | M4OtgdAd60ewzattFh6CTQ | 10 | 4.5 | tradamerican,burgers,hotdogs | 1 |
| Grillshack Fries and Burgers | luS7HvNPlO3lyLkG40C_eA | 43 | 5.0 | burgers,chicken_wings,hotdogs | 1 |
We now have all the necessary general data for the requested locations and can move ro requesting actual customer reviews. For this part, we are using the Reviews endpoint of the Yelp API.
We once again start by creating empty data frames to store the reviews. Since Yelp limits the response to 3 reviews per store, we create 3 columns to store each of the reviews, and 3 to store each of the ratings. We additionally copy over the store IDs and food desert status from previous data frames.
# setting up data frame to collect reviews
grocery_reviews <- data.frame(matrix(ncol = 8, nrow = nrow(grocery_general)))
fastfood_reviews <- data.frame(matrix(ncol = 8, nrow = nrow(fastfood_general)))
# renaming columns
cols <- c("store_id", "review_1", "review_2", "review_3", "rating_1",
"rating_2", "rating_3", "food_desert")
colnames(grocery_reviews) <- cols
colnames(fastfood_reviews) <- cols
# adding store ID and food desert status to data frame
grocery_reviews$store_id <- grocery_general$store_id
fastfood_reviews$store_id <- fastfood_general$store_id
grocery_reviews$food_desert <- grocery_general$food_desert
fastfood_reviews$food_desert <- fastfood_general$food_desert
To fill in the data frames, we created another function that goes through the of stores and gets the reviews and ratings.
# this function returns a data frame with store ratings and reviews
get_reviews <- function(review_df, yelp_key) {
for (row in 1:nrow(review_df)) {
# get store ID and set up URL
business_id <- review_df[row, "store_alias"]
url <- paste0("https://api.yelp.com/v3/businesses/", business_id, "/reviews")
# request reviews for store & extract text
result <- GET(url, add_headers(Authorization = paste("Bearer", yelp_key)))
if(result$status_code == 200)
res_JSON <- fromJSON(httr::content(result, type = "text", encoding = "UTF-8"))
# save reviews to the data frame
total_found <- length(res_JSON$reviews$text)
if(total_found > 0){
for(review in 1:total_found){
review_df[row,paste("review_",review,sep="")] = res_JSON$reviews$text[review]
review_df[row,paste("rating_",review,sep="")] = res_JSON$reviews$rating[review]
}
}
}
return(review_df)
}
To complete the data collection process, we save the final files and remove any duplicate rows. Once again, the function calls are commented out to prevent waiting for and running over the API limit, but a previously saved version of the raw data is loaded in.
#grocery_reviews <- get_reviews(grocery_reviews, yelp_key)
grocery_reviews <- read.csv('grocery_reviews_raw.csv')
grocery_reviews <- grocery_reviews[!duplicated(grocery_reviews),]
kable_styling(kable(head(grocery_reviews)))
| store_id | review_1 | review_2 | review_3 | rating_1 | rating_2 | rating_3 | food_desert |
|---|---|---|---|---|---|---|---|
| zXsu5PpGUHOnuFQiGbRGoQ | Another nice corner store in West Memphis Arkansas. I stop here everyday before I go to work to pick up one thing that I need. They are always packed with… | NA | NA | 4 | NA | NA | 1 |
| 2UI6RQ1nK9Bu0YrZbGaf-w | The girl that calls herself the manager is rude and a smart aleck. The store is filthy. There are boxes always in the isles that make maneuvering the… | My shopping experience today was lacking professional behavior and respect from the employee/manager. I shopped around the store and used the store’s app… | NA | 1 | 1 | NA | 1 |
| 8dv3v6zieubimwOzNCa3pQ | I love this location. All of the staff is super friendly! This location is on top on its cleanliness. Although it is smaller than the super center down the… | Love this location. Staff is friendly. Shelves always stocked and a lot cleaner then your normal Walmart grocery. | NA | 5 | 5 | NA | 1 |
| zDvg2Zpvcd2k5isyJbbSmQ |
This Walmart is probably safer than most Walmarts in Memphis. Your greatest chance of violence here is being bitten by mosquitoes. Do not shop here on… |
What a beautiful establishment. They have a very reasonable selection of meats but can be expensive sometimes depending on what’s available in stock. Can… | This rating goes for most all the Walmart stores in the Memphis area. We have called multiple stores with similar results. The store in West Memphis was the… | 4 | 3 | 1 | 1 |
| 5IOczs_0-7dFnGiOwyM4hg | I love Kroger I can always get a riding cart because I am handicapped they have weekly sales, they keep fresh meat and the workers are so helpful and the… |
I want one of these in Alabama!!!!!!! Great store. Great prices. Floored at the produce variety. Friendly cashier. I throughly enjoyed my walk through the… |
For a food store it was a little on the dirty side. I would expect it to be maintained a little better. | 5 | 5 | 2 | 1 |
| Yu_tXvUoyOj5SpeDUOZwnQ |
SO GOOD! I’m not even vegan or vegetarian but this food slapped! I got the Nashville hot chicken sandwich and my boyfriend got the crunch wrap. The… |
This is a vegan deli, which is an odd new trend (there’s one in Cincinnati, and a few in more coastal cities). It just seems like such a contradiction, and… | Oh my goodness, sooooooo good! We just stumbled onto the place, thought there was no way all the products could be plant-base and taste delicious, and we… | 5 | 4 | 5 | 1 |
#fastfood_reviews <- get_reviews(fastfood_reviews, yelp_key)
fastfood_reviews <- read.csv('fastfood_reviews_raw.csv')
fastfood_reviews <- fastfood_reviews[!duplicated(fastfood_reviews),]
kable_styling(kable(head(fastfood_reviews)))
| store_id | review_1 | review_2 | review_3 | rating_1 | rating_2 | rating_3 | food_desert |
|---|---|---|---|---|---|---|---|
| 59PZLJi5-SVfmeaJoXX0fw | Probably one of the better options for fast food in West Memphis. The restaurant is very clean and looks like it has been remodeled some time recently. The… | Fast food “seafood” is never a good idea & this place reminds you of that. I mean if your just so broke you can’t afford the real deal but really craving… | Clean fast food but I thought the fish was way too greasy. The shrimp was also overcooked. | 5 | 2 | 2 | 1 |
| 4cMH1c2dPcyEjCoAU-EndQ | 1344 w. Missouri, West Memphis TN, dang! Y’all gotta do better! It’s 1pm on May 1. Okay the Popeyes sign says “Lobby is Open” I park my car and the doors… | Popeyes is one of my favorite places to go on my lunch break. I go with the cult classic chicken sandwich most of the time and it’s never been bad from this… | This is the most ridiculous ran fast food I’ve ever been to! the wait in the drive thru line is 15-30 mins & when you get your food it’s NEVER accurate! ask… | 2 | 4 | 1 | 1 |
| 9TWVAP1uubHxcr4wveXY6w | Wendy’s is not the old Wendy it sucks now poor customer service and the food is cold they forgot to give me a whole order that I paid for and they won’t… | Kwanza took our order and she is a sweetheart. I was asking questions about a limited time special they were running, and wanted to know if it was worth the… | This Wendy’s was very busy but we got our food pretty fast! The lady that took our order was short but nice, she gets 4 stars from me! The reasons I give it… | 1 | 5 | 2 | 1 |
| fTm-hODqvLVtIWZYRGsHNQ | They are really very good at serving delicious food, and usually accomplishing it rather quickly. It’s nice to receive true “fast food”, and I have always… | What a wonderful false advertised photo! Great salad but it doesn’t come with the coke you see on the sign behind the counter. It’s advertised as a meal… | I was excited to see a Chick-fil-A, but super disappointed when my salad didn’t have any chicken in it. I should’ve check my order before I left the drive… | 4 | 2 | 2 | 1 |
| M4OtgdAd60ewzattFh6CTQ | Had the catfish dinner with fries and i was completely satisfied. Came with fries and a salad. 2 big fillets and nice amount of fries. The fish was seasoned… | My coworkers and I are from out of town and doing some work close by. We happen to find The Port by chance by looking on line and tried it for lunch our… | Great service and delicious food! I have never made it to the buffet at lunch but the sandwiches and burgers are great! | 5 | 5 | 5 | 1 |
| luS7HvNPlO3lyLkG40C_eA | I was craving burgers, and more importantly some good fries. In a city with so much good fried chicken, you would think that the humble potato would get the… | Today I was craving burgers, but I didn’t want to go to a place I’ve visited before. So I opened up my Yelp app and started browsing for burger joins that… | Five stars for the fries alone. Didn’t try anything else on the menu, but if you love thick, delicious French fries hot and straight out of the fryer, make… | 5 | 4 | 5 | 1 |
This completes our data collection process for Yelp reviews and ratings.
To analyze the Yelp data, we take 2 approaches:
We first look at overall grocery store ratings for food deserts vs. non-deserts. We construct a histogram to compare rating distributions.
The histogram shows that grocery store ratings are similarly distributed between food deserts and non-deserts, with most stores getting a rating of around 4. However, the average rating, shown by the dashed line, does tend to be slightly lower for areas which are not food deserts. Review counts also have a similar distribution between the two areas, with most stores getting very few reviews.
To formally test if the means of reviews are different, we conduct a Mann-Whitney U test (since the data is skewed and not normally distributed). The null hypothesis tested is that the means between the two groups (food desert and non-desert) are the same.
wilcox.test(ratings~ food_desert, data = grocery_general, exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: ratings by food_desert
## W = 2092864, p-value = 0.0004698
## alternative hypothesis: true location shift is not equal to 0
Since the p-values are both < 0.05, we reject the null hypotheses and conclude that the mean rating for grocery stores in food desert areas is higher than the mean rating of grocery stores in non-desert areas. The means between review counts are not compared as the number of reviews can laregaly be population size dependent, which we do not control for.
We repeat the same process for fast food ratings.
Once again, we notice that although the general distribution of ratings is similar between the areas, the ratings for fast food in food desert areas is slightly higher. Review counts also have a similar distribution between the two areas, with most stores getting very few reviews. We repeat the Mann-Whitney U test for comparing the mean of ratings.
# test if the means are equal
wilcox.test(ratings~ food_desert, data = fastfood_general, exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: ratings by food_desert
## W = 1985642, p-value = 0.4344
## alternative hypothesis: true location shift is not equal to 0
The differences in means us once again < 0.05 and therefore significant. We conclude that the mean rating for fast food restaurants in food deserts is also higher than the mean for non-deserts.
In general, it appears that ratings and review counts for both grocery stores and fast food restaurants are higher in food desert areas.
Note: this section was completed using the help of THIS tutorial
To analyze review data, we first create data frames that stack all reviews together (1 row per review as opposed to 1 business and 3 reviews per row).
# this function creates a data frame with all reviews, unique IDs for each review, and their rating (out of 5 stars)
unstack_reviews <- function(review_df){
# unstack reviews and ratings
reviews_text <- data.frame(review_df[8], stack(review_df[2:4]))
reviews_stars <- data.frame(review_df[8], stack(review_df[5:7]))
# combine in a new data frame
final_reviews <- data.frame (food_desert = reviews_text$food_desert,
text = reviews_text$values,
stars = reviews_stars$values)
# add a unique ID to each review
final_reviews$review_id <- 1:length(final_reviews$text)
return(final_reviews)
}
# get review data frames for each category
final_grocery <- unstack_reviews(grocery_reviews)
## Warning in data.frame(review_df[8], stack(review_df[2:4])): row names were found
## from a short variable and have been discarded
## Warning in data.frame(review_df[8], stack(review_df[5:7])): row names were found
## from a short variable and have been discarded
final_fastfood <- unstack_reviews(fastfood_reviews)
## Warning in data.frame(review_df[8], stack(review_df[2:4])): row names were found
## from a short variable and have been discarded
## Warning in data.frame(review_df[8], stack(review_df[2:4])): row names were found
## from a short variable and have been discarded
# join all reviews together
final_grocery$fast_food <- 0
final_fastfood$fast_food <- 1
all_reviews <- rbind(final_grocery,final_fastfood)
# create categories for location and type of store
all_reviews <- all_reviews %>%
mutate(category = case_when(
(food_desert == 1 & fast_food == 1) ~ 'food desert, fast food',
(food_desert == 1 & fast_food == 0) ~ 'food desert, grocery',
(food_desert == 0 & fast_food == 1) ~ 'non-desert, fast food',
(food_desert == 0 & fast_food == 0) ~ 'non-desert, grocery'))
To perform sentiment analysis, we will be using the AFINN lexicon from the tidytext library. This will give us words scored based on sentiment on a scale of -5 (negative sentiment) to +5 (positive sentiment).
# set up AFINN lexicon for sentiment analysis
afinn <- get_sentiments("afinn") %>%
mutate(lexicon = "afinn", words_in_lexicon = n_distinct(word)) %>%
dplyr::select(word, afinn_score = value)
kable_styling(kable(head(afinn)))
| word | afinn_score |
|---|---|
| abandon | -2 |
| abandoned | -2 |
| abandons | -2 |
| abducted | -2 |
| abduction | -2 |
| abductions | -2 |
We first turn each review into a one-row-per-term document and perform some basic pre-processing steps including tokenization, removing stop words, and removing formatting symbols. We assess the sentiment of words used in each review by joining them with the AFINN lexicon. We get the overall sentiment score for each individual review by averaging the sentiment of the existing terms within the review.
# split terms into rows and clean up
review_words <- all_reviews %>%
dplyr::select(review_id, text, stars, food_desert, fast_food, category) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word, str_detect(word, "^[a-z']+$"))
# average sentiment ratings per review to get overall sentiment
all_sentiments <- review_words %>%
inner_join(afinn, by = "word") %>%
dplyr::group_by(review_id, stars, food_desert, fast_food, category) %>%
dplyr::summarise_at(vars(afinn_score), list(sentiment = mean))
After getting sentiment scores, we want to ensure that the sentiment generally follows the same trend as the star ratings. In other words, grocery stores and restaurants which are rated 4-5 stars should have higher average sentiment scores than those rated 1-2 stars. The sentiment score should increase as the ratings increase. We check this by generating box-and-whisker plots.
As expected, we see that the average sentiment score increases with an increased store ratings. In each category, there are outliers and ‘incorrect’ classification as well, but the average seems to agree with the expectations.
Comparing Sentiment Scores Between Categories
Since the general validity of the sentiment scores has been confirmed, we can now compare the average sentiment scores between food deserts and non-deserts, as well as grocery stores and fast food restaurants.
ggplot(all_sentiments, aes(category, sentiment, group = category)) +
geom_boxplot() + ggtitle("Sentiment Ratings by Category") +
ylab("Average Sentiment Score") + xlab('Category')
Based on the generated plot, we notice that the fast food ratings seem fairly consistent between areas that are food deserts and non-deserts. Non-deserts appear to have a slightly lower minimum for the average sentiment score, but the distributions are generally the same. Grocery stores, on the other hand, seem to have slightly higher sentiment scores in non-deserts in general, though the means seem fairly equal between food deserts and non-deserts. Lastly, grocery stores seem to be rated higher across the board for both food deserts and non-deserts, with higher averages and higher Q1 and Q3 values as compared to fast food restaurants.
We formally test the difference between all the categories.
wilcox.test(sentiment~ food_desert, data =
all_sentiments[all_sentiments$fast_food == FALSE, ], exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: sentiment by food_desert
## W = 9020148, p-value = 0.000685
## alternative hypothesis: true location shift is not equal to 0
There is a significant difference in mean sentiment scores between grocery stores in food desert areas vs. non-desert areas. The mean sentiment score is higher for grocery stores in non-desert areas than the mean sentiment score for grocery stores in food deserts.
wilcox.test(sentiment~ food_desert, data =
all_sentiments[all_sentiments$fast_food == TRUE, ], exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: sentiment by food_desert
## W = 9557939, p-value = 0.1963
## alternative hypothesis: true location shift is not equal to 0
There is no significant difference in mean sentiment scores between fast food stores in food desert vs. and non-desert areas.
wilcox.test(sentiment~ fast_food, data =
all_sentiments[all_sentiments$food_desert == FALSE, ], exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: sentiment by fast_food
## W = 11505642, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
There is a significant difference in mean sentiment scores between grocery stores and fast food restaurants in non-desert areas. The mean sentiment score is higher for grocery stores than it is for fast food in non-desert areas.
wilcox.test(sentiment~ fast_food, data =
all_sentiments[all_sentiments$food_desert == TRUE, ], exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: sentiment by fast_food
## W = 9434905, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
There is a significant difference in mean sentiment scores between grocery stores and fast food restaurants in food desert areas. The mean sentiment score is higher for grocery stores than it is for fast food in food desert areas.
Most Positive and Negative Words
As a last step we look at which positive and negative words appear the most in reviews for food desert areas vs. non desert areas.
get_world_cloud <- function(review_words, category, num_reviews){
# select category of reviews
review_words <- review_words[review_words$category == category, ]
# ungroup and count how many times words appear in review
words_counted <- review_words %>%
dplyr::count(review_id, stars, category, word) %>%
ungroup()
# count occurances of each word and get mean rating per word
word_summaries <- words_counted %>%
group_by(word) %>%
dplyr::summarize(reviews = n(),
uses = sum(n),
average_stars = mean(stars)) %>%
ungroup()
# get only words used in over num_reviews # of reviews
top_words <- word_summaries %>%
filter(reviews >= num_reviews) %>%
arrange(desc(average_stars))
# create word cloud
wordcloud(words = top_words$word, freq = top_words$uses, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35)
}
Get POSITIVE word could for grocery stores in food deserts:
par(mfrow=c(1,1))
get_world_cloud(review_words, 'food desert, grocery', 80)
Get POSITIVE word could for grocery stores in non-deserts:
get_world_cloud(review_words, 'non-desert, grocery', 80)
Based on the word clouds, there does not seem to be a huge difference in positive reviews between grocery stores in the two areas. Both clouds contain the words like variety, selection, and quality. It appears that the grocery stores in food deserts are of good quality, the main problem may be that they are just not as accessible as grocery stores in non-desert areas.
From the previous analysis, we can see that there are some patterns that can not be denied when we try to understand how the number of groceries stores affects people’s health, especially obesity.
From the previous outputs of the various statistical techniques employed for this project, we can see that in every analysis income was the most important variable to predict obesity. This result seems to be instinctive of the technique used for the analysis, being a pattern that we already identify from the descriptive statistics made at the beginning of this project. From the unsupervised machine learning algorithms, we were able to identify the same pattern for zip codes with a higher number of grocery stores that were linked with the variable obesity. This can be linked to the fact that higher income is also related to a higher number of grocery stores.
We also see that our results depend on location, for example in the clustering algorithm run before there seems to be a clear difference between the state of New York and other states. New York has a higher number of grocery stores per square mile, higher income, and fewer health problems like obesity and cancer. Analyzing this for a specific state (Wyoming) we found some similar results but some clearer relationships appeared like the effect of education.
Once we started exploring with supervised machine learning we saw some new variables that seem to be highly correlated with obesity, especially the proportion of the black population in the zip code area and education. Store count per square mile has a negative association with obesity once controlling for health and demographic variables.
As a general observation from this project, acknowledging that having only observed data we can not assume complete causality, but this certainly allows us to understand the variables that seem to have to be related to health problems like obesity and the relationship this has with food deserts.
Using an API we try to understand if there was a significant difference between ratings on Yelp of grocery stores in areas considered food deserts and areas that were not categorized as such. We saw that there is a statistical difference in the mean rating between both, with grocery stores having a higher score in food deserts; these results were similar for fast food ratings. This trend was double-checked by running a sentiment analysis.
It is clear that food deserts in the United States are a huge problem that should be addressed, especially when this problem is affecting sensitive populations like minorities. In future studies we can do a more detailed analysis in different states, to see if the same patterns in the data appear.
We have found there is a negative correlation between obesity and the number of grocery stores per square mile. This confirms the existence of “food deserts” and their negative impact on a community. Further, this lack of quality food and the presence of obesity are greater in areas with lower income, more black residents, and some measures of lower education level. It would seem, though more study is needed, that these findings can be extended to other health conditions, such as chronic heart disease.
General References
Finlay, Jessica, Li, Mao, Esposito, Michael, Gomez-Lopez, Iris, Khan, Anam, Clarke, Philippa, and Chenoweth, Megan. National Neighborhood Data Archive (NaNDA): Grocery Stores by ZIP Code Tabulation Area, United States, 2003-2017. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-10-01. https://doi.org/10.3886/E123042V1
Rhone, Alana, Ver Ploeg, Michele, Dicken, Chris, Williams, Ryan, and Breneman, Vince.
Low-Income and Low-Supermarket-Access Census Tracts, 2010-2015, EIB-165, U.S.
Department of Agriculture, Economic Research Service, January 2017.
Programming References
https://albertostefanelli.github.io/SEM_labs/KUL/labs/html_lab/SEM_lab_1.html
https://thatdatatho.com/easily-create-descriptive-summary-statistic-tables-r-studio/
https://www.r-bloggers.com/2016/07/does-sentiment-analysis-work-a-tidy-analysis-of-yelp-reviews/